[Feature] Atomic Reduction Operations and Vectorization Enhancement#1676
[Feature] Atomic Reduction Operations and Vectorization Enhancement#1676LeiWang1999 merged 15 commits intotile-ai:mainfrom
Conversation
…d functionality - Added support for atomic max and min operations with corresponding classes and methods. - Introduced vectorization capabilities for atomic add operations based on data type and target architecture. - Refactored atomic add intrinsic calls to improve clarity and consistency. - Enhanced layout inference for atomic operations and integrated new utility functions for target architecture checks. - Updated tests to validate the new atomic operations and their vectorization behavior.
- Cleaned up whitespace and formatting in atomic add and reduce implementations for better code clarity. - Enhanced comments in CUDA code generation for atomic operations to improve understanding. - Updated function signatures and layout definitions for consistency across atomic operation files. - Ensured proper alignment of code for better maintainability and readability.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Caution Review failedThe pull request is closed. Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughRefactors atomic ops and vectorization: adds an AtomicOpBaseNode and AtomicMax/AtomicMin, reworks AtomicAdd lowering to use ParallelOp + multi-level layout inference, replaces planner-driven atomic_add vectorization with an inline rewriter, threads LayoutMap through vectorize/lowering, adds TL builtins and CUDA/HIP codegen, and updates tests. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User Code
participant Lang as tilelang (atomic.py)
participant TileOp as TL TileOp / ElemOp
participant Infer as ParallelOp InferLayout
participant Lower as LowerParallelLoop
participant Vect as AtomicAddVectorizeRewriter
participant Codegen as CUDA/HIP Codegen
User->>Lang: emit atomic_* (scalar or tensor)
Lang->>Lang: choose scalar vs tile-region path
alt Tile-region
Lang->>TileOp: create tl.tileop.atomic*
TileOp->>Infer: call InferLayout via ParallelOp pipeline
Infer->>Lower: call LowerParallelLoop (provides layout_map)
else Scalar
Lang->>Codegen: emit tl.atomic_*_elem_op call
end
Lower->>Vect: pass lowered loop IR (may contain atomic_add_elem_op)
Vect->>Vect: detect loop extent & decide vector width (dtype, SM)
alt Vectorizable
Vect->>TileOp: rewrite atomic_add -> atomic_addx2/x4
else Non-vectorizable
Vect->>TileOp: leave atomic_add_elem_op
end
TileOp->>Codegen: emit target intrinsics (handle memory_order)
Codegen->>User: kernel source / binary
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces atomic reduction operations (atomic_max, atomic_min) and enhances atomic operations with automatic vectorization support, along with significant code architecture improvements using the CRTP pattern.
Changes:
- Added new atomic reduction operations (
atomic_max,atomic_min) with both element-wise and tile-region-based modes - Implemented automatic vectorization of
atomic_add_elem_opto vectorized variants (atomic_addx2_elem_op,atomic_addx4_elem_op) - Refactored atomic operation architecture using CRTP pattern with
AtomicOpBaseNode<T>base class to reduce code duplication
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tilelang/language/atomic.py | Added atomic_max and atomic_min Python APIs with scalar and tile-region support; migrated from call_extern to call_intrin |
| src/op/atomic_reduce.h | Introduced AtomicOpBaseNode CRTP base class and AtomicMax/AtomicMin node classes |
| src/op/atomic_reduce.cc | Implementation of atomic reduction operations with shared base functionality |
| src/op/atomic_add.h | Refactored AtomicAddNode to inherit from AtomicOpBaseNode |
| src/op/atomic_add.cc | Simplified atomic add implementation using base class methods |
| src/op/builtin.h | Added new atomic operation intrinsic declarations |
| src/op/builtin.cc | Registered new atomic operation intrinsics |
| src/transform/atomicadd_vectorize.h | Simplified header with new vectorization API |
| src/transform/atomicadd_vectorize.cc | Complete rewrite of atomic add vectorization using vectorized loop detection |
| src/transform/loop_vectorize.cc | Added vectorization planning support for atomic_add_elem_op |
| src/transform/loop_partition.cc | Integrated atomic add vectorization pass |
| src/target/codegen_cuda.cc | Added CUDA codegen for all new atomic intrinsics |
| src/target/utils.h | Added TargetHasSMVersionGE utility function |
| src/target/utils.cc | Implementation of SM version check utility |
| src/op/parallel.cc | Refactored to use IsSharedBuffer/IsGlobalBuffer utilities |
| testing/python/language/test_tilelang_language_atomic.py | Added comprehensive tests for tile-level atomic max/min operations |
Comments suppressed due to low confidence (1)
testing/python/language/test_tilelang_language_atomic.py:29
- The
printstatement was added for debugging purposes. In production test code, consider removing this or wrapping it in a debug flag to avoid cluttering test output.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/target/utils.cc (1)
63-63: Pre-existing bug: Bitwise AND instead of logical AND.Line 63 uses
&(bitwise AND) instead of&&(logical AND):return arch >= 100 & arch <= 110;Due to operator precedence, this evaluates as
arch >= (100 & arch) <= 110, which is incorrect. This is outside the scope of the current PR changes but could cause incorrect SM100 detection.Proposed fix
- return arch >= 100 & arch <= 110; + return arch >= 100 && arch <= 110;
🤖 Fix all issues with AI agents
In `@src/op/atomic_add.h`:
- Around line 69-70: The declaration of LowerTMA is unused: either remove the
unused method declaration Stmt LowerTMA(const LowerArgs &T, arith::Analyzer
*analyzer) const from the class, or extract the TMA-specific lowering block
currently embedded in Lower() into a new method named LowerTMA and have Lower()
call LowerTMA; ensure the signature matches the declaration and move all
TMA-related variables and logic from Lower() (the code between the TMA handling
boundaries) into LowerTMA so there are no duplicate implementations or dead
declarations.
In `@src/op/atomic_reduce.cc`:
- Around line 159-162: The scalar fast-path incorrectly uses a single index {0}
for BufferLoad/BufferStore even when the buffers are multi-dimensional; update
the branch where is_scalar is true to build the full index list instead of
hardcoding {0} — e.g., call the existing MakeIndices helper (or construct a
vector of zeros of length dst->ndim using ranges[i]->min) to produce one index
per dimension and pass that to BufferLoad(src, indices) and BufferStore(dst,
value, indices), leaving the surrounding For(Var("i"), ...) and returns intact.
In `@src/transform/atomicadd_vectorize.cc`:
- Around line 139-146: The vectorized atomic call currently drops annotations
(e.g., memory_order) from the original node; preserve node->annotations when
constructing the vectorized Call returned by GetVectorizedAtomicOp(vector_size_)
so the memory ordering and other metadata are carried over (i.e., include
node->annotations on the Call that uses addr_dst and addr_src and
node->dtype/GetVectorizedAtomicOp).
In `@src/transform/loop_vectorize.cc`:
- Around line 151-182: The code incorrectly calls Downcast<PrimExpr>(node) with
a raw const CallNode*; replace those calls with ffi::GetRef<PrimExpr>(node) so
the raw pointer is converted to a proper ObjectRef/smart pointer before
returning; update both occurrences (the return in the atomic_add_elem_op branch
and the return in the "Other calls" branch) to use ffi::GetRef<PrimExpr>(node)
consistent with the pattern used elsewhere (e.g., ffi::GetRef<Stmt>(node)) and
leave the surrounding logic (vector_size_ updates and early returns) unchanged.
In `@tilelang/language/atomic.py`:
- Line 376: The return_type determination in atomic_addx4 uses a string-based
check ("float" in str(dst.dtype).lower()) which is inconsistent with
atomic_addx2; update atomic_addx4 to use the same dtype-based check as
atomic_addx2—inspect dst.dtype directly to decide float vs handle (i.e., mirror
the exact condition used in atomic_addx2) so return_type is set consistently for
float and non-float types.
- Around line 375-377: The atomic_addx4 implementation ignores the return_prev
flag because atomic_addx4_op is set to the same op in both branches; update the
selection in atomic_addx4 (variable atomic_addx4_op) to pick the return-value
variant when return_prev is true (e.g.
op.Op.get("tl.atomic_addx4_elem_op_return") or whatever naming is used for
return-value ops) and use the existing non-return op otherwise, and if a
return-value variant is not present, raise NotImplementedError to make the
limitation explicit.
- Around line 337-339: The code in atomic_addx2 and atomic_addx4 ignores the
return_prev flag because atomic_addx2_op/atomic_addx4_op selection always picks
the same op; update the selection to mirror atomic_add by choosing
"tl.atomic_addx2_ret_elem_op" / "tl.atomic_addx4_ret_elem_op" when return_prev
is True and the non-_ret variants otherwise, but since those _ret ops are not
declared you must either (A) add the corresponding builtin op
declarations/definitions (tl.atomic_addx2_ret_elem_op and
tl.atomic_addx4_ret_elem_op) to the backend builtins (src/op/builtin.h and
src/op/builtin.cc) or (B) follow the atomic_add tile-region pattern and raise
NotImplementedError when return_prev is True in atomic_addx2 and atomic_addx4;
adjust return_type and T.call_intrin usage accordingly and reference the symbols
atomic_addx2_op, atomic_addx4_op, atomic_addx2_ret_elem_op,
atomic_addx4_ret_elem_op, and the functions atomic_addx2 / atomic_addx4 when
making the change.
🧹 Nitpick comments (3)
tilelang/language/atomic.py (1)
60-68: Consider extracting duplicatedget_extenthelper.The
get_extentfunction is defined identically inatomic_max,atomic_min, andatomic_add. Consider extracting it to a module-level helper or importing fromcopy_op.pywhere a similar helper already exists (seetilelang/language/copy_op.pylines 63-76).♻️ Suggested refactor
+def _get_extent(data): + """Return the inferred extent (shape) of a buffer-like object.""" + if isinstance(data, Var) and T.has_let_value(data): + data = T.get_let_value(data) + if isinstance(data, Buffer): + return data.shape + elif isinstance(data, BufferRegion): + return [x.extent for x in data.region] + else: + return None + + def atomic_max(dst: Buffer, value: PrimExpr, memory_order: str | None = None, return_prev: bool = False) -> PrimExpr: # ... docstring ... - - def get_extent(data): - if isinstance(data, Var) and T.has_let_value(data): - data = T.get_let_value(data) - if isinstance(data, Buffer): - return data.shape - elif isinstance(data, BufferRegion): - return [x.extent for x in data.region] - else: - return None - - src_extent = get_extent(value) - dst_extent = get_extent(dst) + src_extent = _get_extent(value) + dst_extent = _get_extent(dst)Apply the same pattern to
atomic_minandatomic_add.Also applies to: 147-155, 239-257
testing/python/language/test_tilelang_language_atomic.py (1)
29-29: Consider removing debug print statements before merge.While useful during development,
print(kernel.get_kernel_source())statements can clutter CI logs. Consider removing them or gating behind a verbose flag if they're not needed for the final tests.src/op/atomic_add.cc (1)
90-100: Consider unifying vectorization length logic withatomicadd_vectorize.cc.This logic duplicates
AtomicAddVectorizeRewriter::GetMaxVectorSizeinsrc/transform/atomicadd_vectorize.cc(lines 64-73). Consider extracting to a shared utility to avoid divergence.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
src/op/atomic_add.ccsrc/op/atomic_add.hsrc/op/atomic_reduce.ccsrc/op/atomic_reduce.hsrc/op/builtin.ccsrc/op/builtin.hsrc/op/parallel.ccsrc/target/codegen_cuda.ccsrc/target/utils.ccsrc/target/utils.hsrc/transform/atomicadd_vectorize.ccsrc/transform/atomicadd_vectorize.hsrc/transform/loop_partition.ccsrc/transform/loop_vectorize.cctesting/python/language/test_tilelang_language_atomic.pytilelang/language/atomic.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2026-01-06T05:20:51.649Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1606
File: testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py:30-30
Timestamp: 2026-01-06T05:20:51.649Z
Learning: In `testing/python/transform/test_tilelang_transform_hoist_broadcast_values.py`, the test validates that the `hoist_broadcast_values` transformation pass correctly identifies and hoists broadcast operations by checking for patterns in the generated kernel source code. The specific literal values used (e.g., 430) are not important for the test's purpose, as it does not validate numerical precision or actual stored tensor values.
Applied to files:
testing/python/language/test_tilelang_language_atomic.py
📚 Learning: 2025-12-18T04:50:00.512Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1464
File: testing/python/language/test_tilelang_language_rand.py:14-14
Timestamp: 2025-12-18T04:50:00.512Z
Learning: In `testing/python/language/test_tilelang_language_rand.py`, the TileLang kernel uses `blk_M = M` (single block) and calls `rng_rand()` four times per element to align results with the Triton implementation, which uses `blk_M = 128` (multiple blocks) and calls the RNG once per element. These differences compensate for internal RNG behavior differences between TileLang and Triton.
Applied to files:
testing/python/language/test_tilelang_language_atomic.py
📚 Learning: 2025-11-14T07:56:11.098Z
Learnt from: lucifer1004
Repo: tile-ai/tilelang PR: 1256
File: testing/python/jit/test_tilelang_jit_gemm_nvrtc.py:55-115
Timestamp: 2025-11-14T07:56:11.098Z
Learning: In `testing/python/jit/test_tilelang_jit_gemm_nvrtc.py`, the global function `tilelang_callback_cuda_postproc` registered via `tvm.register_global_func(..., override=True)` is intentionally not restored after the test completes, as the persistent behavior is expected.
Applied to files:
testing/python/language/test_tilelang_language_atomic.py
🧬 Code graph analysis (8)
src/target/utils.h (1)
src/target/utils.cc (2)
TargetHasSMVersionGE(130-135)TargetHasSMVersionGE(130-130)
src/transform/loop_partition.cc (1)
src/transform/atomicadd_vectorize.cc (2)
VectorizeAtomicAdd(157-160)VectorizeAtomicAdd(157-157)
src/target/utils.cc (2)
src/op/gemm.cc (2)
GetArchInt(413-424)GetArchInt(413-413)src/op/gemm_py.cc (2)
GetArchInt(230-241)GetArchInt(230-230)
src/transform/atomicadd_vectorize.h (1)
src/transform/atomicadd_vectorize.cc (2)
VectorizeAtomicAdd(157-160)VectorizeAtomicAdd(157-157)
src/op/atomic_add.cc (3)
src/op/atomic_reduce.cc (4)
GetElemOpStatic(52-52)GetElemOpStatic(52-52)GetElemOpStatic(81-81)GetElemOpStatic(81-81)src/transform/atomicadd_vectorize.cc (2)
dtype(64-73)dtype(64-64)src/target/utils.cc (2)
TargetHasSMVersionGE(130-135)TargetHasSMVersionGE(130-130)
src/op/parallel.cc (1)
src/op/utils.h (1)
IsGlobalBuffer(55-57)
tilelang/language/atomic.py (5)
tilelang/language/copy_op.py (1)
get_extent(64-77)tilelang/language/frame.py (2)
has_let_value(188-197)get_let_value(200-209)tilelang/language/utils.py (1)
region(7-10)tilelang/language/tir/op.py (1)
call_intrin(120-145)tilelang/utils/language.py (2)
legalize_pairwise_extents(406-449)to_buffer_region(194-237)
testing/python/language/test_tilelang_language_atomic.py (2)
tilelang/transform/pass_config.py (1)
PassConfigKey(6-163)tilelang/language/atomic.py (2)
atomic_max(20-104)atomic_min(107-191)
🪛 Cppcheck (2.19.0)
src/op/atomic_reduce.cc
[error] 281-281: syntax error
(syntaxError)
🪛 Ruff (0.14.11)
tilelang/language/atomic.py
337-337: Useless if-else condition
(RUF034)
375-375: Useless if-else condition
(RUF034)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Quick Lint
🔇 Additional comments (35)
src/op/builtin.h (1)
588-658: LGTM! Well-organized atomic operation declarations.The new atomic operation declarations follow a consistent naming convention and are well-documented. The separation between regular operations (
atomic_*_elem_op) and return-value variants (atomic_*_ret_elem_op) is clean.Note: The vectorized variants (
atomic_addx2_elem_op,atomic_addx4_elem_op) don't have corresponding_ret_versions. This is consistent with the code inatomic.py, but as noted in that file's review, thereturn_prevparameter inatomic_addx2/atomic_addx4should either be implemented or raiseNotImplementedError.src/transform/atomicadd_vectorize.h (1)
19-29: LGTM! Clean API simplification.The simplified interface is much cleaner than the previous planner-based approach. The function signature now relies on
Target::Current()internally (as shown in the implementation atsrc/transform/atomicadd_vectorize.cclines 156-159), which is appropriate for this use case.src/op/atomic_reduce.h (2)
26-71: Well-designed CRTP base class for atomic operations.The
AtomicOpBaseNode<Derived>template provides a clean foundation for atomic operations with:
- Common buffer/range management
- Shared lowering and layout inference hooks
- Memory order extraction from annotations
- CRTP pattern for element-wise operation dispatch via
GetElemOp()The
mutablequalifier onpar_op_is appropriate for caching during const operations.
77-139: LGTM! Consistent AtomicMax and AtomicMin node implementations.Both
AtomicMaxNodeandAtomicMinNodefollow the same structure with:
- FFI reflection declarations
- Static
Get()andGetElemOpStatic()accessors returningatomic_max_elem_op()andatomic_min_elem_op()respectivelyClone()implementationRegisterReflection()for property accesssrc/target/utils.h (1)
32-32: LGTM! Useful utility for SM version checks.The
TargetHasSMVersionGEfunction provides a clean way to check CUDA SM version compatibility. The implementation correctly handles non-CUDA targets by returningfalse.src/transform/loop_partition.cc (2)
32-32: LGTM!The new include for atomic add vectorization is appropriately placed with other transform headers.
300-302: LGTM!The atomic add vectorization step is correctly positioned after general loop vectorization (Step 2) and before predicate wrapping (Step 4). This ordering ensures that vectorized atomic operations benefit from the loop vectorization pass while the predicate guard is still applied correctly to the final transformed loop.
src/target/utils.cc (1)
130-135: LGTM!The
TargetHasSMVersionGEfunction follows the established pattern for target capability checks in this file. The implementation correctly handles non-CUDA targets and delegates toGetArchIntfor architecture comparison.src/op/parallel.cc (2)
12-12: LGTM!The include is appropriately added for arithmetic helper utilities.
457-461: LGTM!The refactoring from string-based scope checks to helper predicates (
IsSharedBuffer,IsGlobalBuffer) improves code maintainability and consistency with similar changes inloop_vectorize.cc. The logic remains functionally equivalent while being more readable.Also applies to: 476-476
src/transform/loop_vectorize.cc (2)
27-27: LGTM!The include for
../op/utils.his needed for theIsSharedBufferandIsGlobalBufferhelper predicates.
122-123: LGTM!Consistent refactoring to use helper predicates instead of inline string comparisons for buffer scope checks.
Also applies to: 137-138
src/target/codegen_cuda.cc (5)
2891-2900: LGTM!The
atomic_add_elem_opcodegen correctly emitsAtomicAdd(dst, src[, memory_order])with optional memory order parameter. The statement-context pattern (PrintIndent + stream) is appropriate.
2901-2909: LGTM!The
atomic_add_ret_elem_opcorrectly outputs toos(expression context) since it returns the previous value, matching the pattern for other return-value operations.
2910-2929: LGTM!The vectorized atomic add variants (
atomic_addx2_elem_op,atomic_addx4_elem_op) correctly emitAtomicAddx2/AtomicAddx4calls with consistent handling of the optional memory order parameter.
2930-2941: LGTM!The
atomic_load_elem_opandatomic_store_elem_ophandlers are correctly implemented:
atomic_loadoutputs toos(returns loaded value)atomic_storeuses PrintIndent + stream (statement context)- Both require the memory_order parameter as expected for explicit memory ordering operations.
2942-2979: LGTM!The
atomic_max_elem_op,atomic_max_ret_elem_op,atomic_min_elem_op, andatomic_min_ret_elem_ophandlers follow the same consistent patterns as the atomic add operations:
- Non-return variants use statement context (PrintIndent + stream)
- Return variants use expression context (os)
- Memory order is optional for all max/min operations
testing/python/language/test_tilelang_language_atomic.py (3)
7-12: LGTM! Pass configs correctly disable optimizations.The explicit disabling of TMA lowering and warp specialization for the scalar atomic_add_program test ensures deterministic behavior and isolates the core atomic_add functionality being tested.
395-429: LGTM! Tile-level atomic max test correctly implemented.The test properly initializes
Bwithfloat("-inf")for max reduction, and the reference program correctly computes element-wise maximum across the K dimension. The kernel structure is consistent with the existingtile_atomic_add_programpattern.
432-466: LGTM! Tile-level atomic min test correctly implemented.The test properly initializes
Bwithfloat("inf")for min reduction, and the reference program correctly computes element-wise minimum across the K dimension. The implementation is symmetric withtile_atomic_maxand follows established patterns.src/transform/atomicadd_vectorize.cc (3)
22-34: LGTM! Buffer load extraction helper is well-implemented.The helper correctly handles both direct
BufferLoadexpressions and those wrapped inaddress_ofcalls, returning anOptionalfor clean error handling.
39-48: LGTM! Vector size to op mapping is correct.The mapping correctly handles x4 (SM>=90 float32), x2 (float16/bfloat16), and falls back to scalar for unsupported cases. The default case provides a safe fallback.
157-160: LGTM! Entry point handles undefined target gracefully.
Target::Current(false)returns an undefined target if no context is set, which is safe sinceTargetHasSMVersionGEreturnsfalsefor non-CUDA targets (persrc/target/utils.cc), disabling SM>=90 float32 vectorization in that case.src/op/atomic_add.cc (3)
77-77: LGTM! Static element op accessor follows established pattern.This is consistent with
AtomicMaxNode::GetElemOpStatic()insrc/op/atomic_reduce.cc(line 51 from relevant snippets).
561-583: LGTM! Non-TMA lowering pipeline is well-structured.The pipeline correctly creates a SIMT loop, fuses parallel loops, runs multi-level layout inference, and lowers with proper predicate handling via
ParallelOp. The three-level inference (kCommon, kStrict, kFree) ensures robust layout selection.
146-149: Scalar case intentionally uses non-atomicBufferStorefor single-threaded execution.The scalar case (when
loop_varsis empty) returns aForloop withForKind::kSerialand extent 1, which guarantees single-threaded execution. This is intentional design—the comment on line 141 explicitly states "For scalar cases a serial For of extent 1 is returned." When only one thread can execute the code (single-threaded loop), atomic semantics are unnecessary. This pattern is consistently used acrossatomic_add.cc,atomic_reduce.cc, andcopy.ccfor all scalar operations.Likely an incorrect or invalid review comment.
src/op/builtin.cc (1)
354-402: LGTM! Atomic operation builtins are correctly defined.The new atomic builtins follow consistent naming conventions (
atomic_<op>_elem_op,atomic_<op>_ret_elem_opfor return-previous variants) and correctly usekOpaqueeffect kind for operations with memory side effects. Input counts are appropriate:
- 3 inputs for add/max/min (dst, src, memory_order)
- 2 inputs for load (src, memory_order)
- 3 inputs for store (dst, value, memory_order)
src/op/atomic_add.h (1)
16-71: LGTM! AtomicAddNode class design is clean and well-documented.The CRTP inheritance from
AtomicOpBaseNode<AtomicAddNode>enables code sharing withAtomicMaxNode/AtomicMinNode. The public API is minimal, protected helpers are appropriately scoped, and the brief doc comments clarify each method's purpose.src/op/atomic_reduce.cc (7)
29-81: LGTM!The AtomicMax and AtomicMin implementations follow a consistent pattern with proper buffer region handling, clone support for parallel operations, and appropriate element-op accessors. The structural similarity between the two is acceptable given the CRTP base class handles the complex shared logic.
102-121: LGTM!The index construction correctly handles unit-extent dimensions by using the range minimum, while properly incorporating iteration variables for non-trivial dimensions. The ICHECK at the end provides useful diagnostic information when there's a mismatch.
123-153: LGTM!The predicate generation correctly handles bounds checking with symbolic analysis to eliminate provably-true conditions. The return of an empty
PrimExprfor the no-predicate case follows TVM conventions.
164-214: LGTM!The non-scalar path correctly:
- Binds iteration variables for analysis
- Constructs proper indices via
MakeIndices- Handles dtype casting between src and dst
- Uses CRTP to retrieve the memory order and element op from the derived type
- Builds nested parallel loops with appropriate annotations
216-233: LGTM!The layout inference correctly enforces that source and destination fragments must have matching layouts for atomic reduce operations, which is a necessary constraint for correctness. Returning an empty
LayoutMapis appropriate since these operations don't introduce new layout requirements.
235-260: LGTM!The lowering implementation follows the established pattern: SIMT loop generation → fusion → parallel op construction → multi-level layout inference → final lowering with predicates. The three-level layout inference (kCommon, kStrict, kFree) provides robust layout handling.
262-284: LGTM!Template instantiations are properly declared for all three atomic operation types. The operator registrations correctly specify:
- 2 inputs (src and dst regions)
kOpaquecall effect kind, appropriate for operations with side effectsRegarding the static analysis hint at line 281: This is a false positive.
TVM_FFI_STATIC_INIT_BLOCK()is a TVM macro that expands to a valid static initialization construct. The analyzer doesn't parse the macro expansion.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
…ions - Updated function signatures in atomic add, atomic reduce, copy, and fill operations to include layout map parameters for improved layout handling. - Refactored vectorization logic to utilize layout maps, ensuring better performance and compatibility with various buffer layouts. - Enhanced the LowerParallelLoop function to accept layout maps, facilitating more efficient loop transformations. - Added checks for buffer contiguity in vectorization processes to ensure correctness when using layout maps. - Updated tests to validate the new layout handling and vectorization behavior.
- Introduced a new method to identify atomic operations within the legalize_safe_memory_access transformation. - Updated the VisitStmt_ function to handle both CallExtern and atomic operations, ensuring recursive condition collection for these cases. - Enhanced comments for clarity on the handling of atomic operations in the context of memory access legality.
- Implemented support for various atomic operations including atomic add, atomic load, atomic store, atomic max, and atomic min in the HIP code generation. - Enhanced the handling of atomic operations to include optional memory order parameters. - Improved code readability with added comments explaining the purpose of each atomic operation. - Ensured consistency with existing atomic operation implementations in the codebase.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/op/atomic_reduce.cc`:
- Around line 30-43: The AtomicMax::AtomicMax constructor assumes args has at
least two elements and dereferences args[0]/args[1] without checks; add a
precondition check (e.g., DCHECK/ICHECK/CHECK_GE or throw a descriptive
std::runtime_error) verifying args.size() >= 2 at the start of
AtomicMax::AtomicMax, and return/raise an error with a clear message if the
condition fails; keep the rest of the logic (calling NormalizeToBufferRegion,
populating rgs/bf, and assigning node fields) unchanged.
In `@src/target/codegen_hip.cc`:
- Around line 997-1035: The codegen emits calls to missing HIP template helpers:
add templated functions AtomicLoad(src_ptr, memory_order), AtomicStore(dst_ptr,
value, memory_order), AtomicMaxRet(address, val, memory_order), and
AtomicMinRet(address, val, memory_order) into the HIP atomic template header
(matching the style and naming of existing
AtomicAdd/AtomicAddRet/AtomicMax/AtomicMin/AtomicAddx2* functions); ensure
AtomicLoad returns the loaded value, AtomicStore is void, and
AtomicMaxRet/AtomicMinRet return the previous value, accept the same
pointer/value types as AtomicMax/AtomicMin, accept an optional memory_order
parameter, and implement them using the same atomic primitive patterns (or
__atomic builtins) and template signatures as the existing Atomic* helpers so
generated code from atomic_load_elem_op, atomic_store_elem_op,
atomic_max_ret_elem_op, and atomic_min_ret_elem_op will compile.
♻️ Duplicate comments (1)
src/op/atomic_reduce.cc (1)
159-163: Scalar case still uses incorrect hardcoded indices.This issue was flagged in a previous review: when
loop_varsis empty but the buffer is multi-dimensional (e.g., shape(1, 1, 1)), using{0}provides only one index instead of the required three. The fix should useMakeIndicesto construct proper indices for all dimensions.
🧹 Nitpick comments (2)
src/op/atomic_reduce.cc (1)
59-72: Same missing bounds check; consider extracting shared constructor logic.This constructor duplicates lines 30-43 verbatim. Consider extracting a helper (e.g.,
InitAtomicOpNode) to reduce duplication and add the bounds check once.♻️ Sketch of helper extraction
// Helper to initialize common atomic op node fields template <typename NodeType> void InitAtomicOpNode(ObjectPtr<NodeType>& node, const Array<PrimExpr>& args, const Map<String, ObjectRef>& annotations) { ICHECK(args.size() >= 2) << "Atomic op requires at least 2 arguments"; Array<Range> rgs[2]; Buffer bf[2]; for (int i = 0; i < 2; i++) { auto region = NormalizeToBufferRegion(args[i]); rgs[i] = region->region; bf[i] = region->buffer; } std::tie(node->src, node->dst) = std::tie(bf[0], bf[1]); std::tie(node->src_range, node->dst_range) = std::tie(rgs[0], rgs[1]); node->annotations = annotations; }src/target/codegen_hip.cc (1)
1036-1047: Consider adding argument count validation for load/store operations.Unlike other atomic operations that check
op->args.size() > 2for optional memory_order,atomic_load_elem_opandatomic_store_elem_opdirectly access indices without validation. AddingICHECKassertions would improve debuggability if malformed IR is ever passed.Suggested defensive checks
} else if (op->op.same_as(tl::atomic_load_elem_op())) { // atomic_load_elem_op(src_ptr, memory_order) -> returns loaded value + ICHECK_EQ(op->args.size(), 2U) + << "atomic_load_elem_op expects 2 arguments (src_ptr, memory_order)"; os << "AtomicLoad(" << PrintExpr(op->args[0]) << ", " << PrintExpr(op->args[1]) << ")"; } else if (op->op.same_as(tl::atomic_store_elem_op())) { // atomic_store_elem_op(dst_ptr, value, memory_order) + ICHECK_EQ(op->args.size(), 3U) + << "atomic_store_elem_op expects 3 arguments (dst_ptr, value, memory_order)"; std::string dst_ptr = PrintExpr(op->args[0]);
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/op/atomic_reduce.ccsrc/target/codegen_hip.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-12T07:25:35.591Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1631
File: src/transform/thread_storage_sync.cc:1126-1137
Timestamp: 2026-01-12T07:25:35.591Z
Learning: In TileLang's thread storage synchronization pass (src/transform/thread_storage_sync.cc), at the IR level where PointerAccessIsDisjoint is called, the threads array in AccessEntry is guaranteed to contain all three thread dimensions (threadIdx.x, threadIdx.y, threadIdx.z), making access to the last 3 elements via `threads[threads.size() + idx - 3]` safe.
Applied to files:
src/op/atomic_reduce.cc
🧬 Code graph analysis (1)
src/op/atomic_reduce.cc (1)
src/op/atomic_reduce.h (1)
AtomicMax(99-107)
🪛 Cppcheck (2.19.0)
src/op/atomic_reduce.cc
[error] 282-282: syntax error
(syntaxError)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (8)
src/op/atomic_reduce.cc (7)
88-101: LGTM - common TVM pattern for loop variable naming.The
char('i' + idx)naming scheme is a standard TVM convention. While it technically overflows for >17 non-unit dimensions, this is practically unreachable for atomic operations.
103-154: LGTM - index and predicate construction is sound.The iteration variable tracking with
idxand theICHECKassertion ensure correct index construction. The predicate optimization usinganalyzer->CanProveis a good practice.
178-214: LGTM - non-scalar SIMT loop generation is well-structured.The atomic call construction with proper dtype casting, pointer generation via
address_of, and memory ordering via CRTPGetMemoryOrder()follows the expected pattern. The nested parallel loop construction with annotation propagation is correct.
217-234: InferLayout returns empty map - verify this is intentional.The method validates that src and dst have matching layouts when both are fragments, but always returns an empty
LayoutMap. This means the operator doesn't contribute any layout constraints back to the inference system. If this is intentional (relying on other operators to establish layouts), this is fine; otherwise, it may cause layout mismatches.
236-261: LGTM - lowering follows the standard ParallelOp pattern.The three-level layout inference (kCommon → kStrict → kFree) and subsequent
LowerParallelLoopintegration is consistent with other tile operators in the codebase.
272-285: LGTM - operator registration and reflection initialization.The tile operator registration with
kOpaquecall effect kind is appropriate for atomic operations. The static analysis "syntax error" at line 282 is a false positive—TVM_FFI_STATIC_INIT_BLOCKis a valid TVM macro for static initialization.
263-266: No ODR violation exists. The explicit instantiation ofAtomicOpBaseNode<AtomicAddNode>appears only inatomic_reduce.cc:266. Theatomic_add.ccfile does not contain a duplicate instantiation—it only provides method implementations, operator registration, and FFI initialization. This is the correct pattern and poses no ODR risk.Likely an incorrect or invalid review comment.
src/target/codegen_hip.cc (1)
1048-1085: LGTM!The atomic max/min implementations follow the same established pattern as the atomic add variants. The optional memory_order handling is consistent, and the statement vs expression form handling is correct.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
- Changed the GetElemOpStatic method to a virtual GetElemOp method in AtomicOpBaseNode, enhancing polymorphism. - Updated AtomicAddNode, AtomicMaxNode, and AtomicMinNode to override the new GetElemOp method. - Removed unnecessary template parameters from AtomicOpBaseNode, simplifying the class structure. - Cleaned up includes in atomic_reduce.cc to remove unused dependencies, improving code organization.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/op/atomic_add.cc (2)
146-148: Scalar case uses hardcoded single-element index array.When
loop_varsis empty (all dimensions have extent 1), the buffer may still be multi-dimensional (e.g., shape(1, 1, 1)). Using{0}provides only one index. Consider usingMakeIndicesto generate proper indices for all dimensions.🐛 Proposed fix
if (is_scalar) { + Array<PrimExpr> src_indices = MakeIndices(loop_vars, 0); + Array<PrimExpr> dst_indices = MakeIndices(loop_vars, 1); return For(Var("i"), 0, 1, ForKind::kSerial, - BufferStore(dst, BufferLoad(src, {0}), {0})); + BufferStore(dst, BufferLoad(src, src_indices), dst_indices)); }
169-171: Predicates computed but never used.
src_predicateanddst_predicateare computed but not used in the atomic call or loop body. If bounds checking is needed, wrap the atomic call in anIfThenElse. Otherwise, remove these unused computations.Option 1: Remove unused predicates
- // Optional bounds predicates for src and dst - PrimExpr src_predicate = MakePredicate(analyzer, loop_vars, src->shape, 0); - PrimExpr dst_predicate = MakePredicate(analyzer, loop_vars, dst->shape, 1); - // Load source value and cast to dst dtype if neededOption 2: Use predicates for bounds checking
Stmt body = tvm::tir::Evaluate(atomicadd_call); + + // Guard with bounds predicate if needed + PrimExpr combined_predicate; + if (src_predicate.defined() && dst_predicate.defined()) { + combined_predicate = And(src_predicate, dst_predicate); + } else if (src_predicate.defined()) { + combined_predicate = src_predicate; + } else if (dst_predicate.defined()) { + combined_predicate = dst_predicate; + } + if (combined_predicate.defined()) { + body = IfThenElse(combined_predicate, body); + }
🤖 Fix all issues with AI agents
In `@src/op/atomic_reduce.cc`:
- Around line 58-71: The AtomicMin constructor accesses args[0] and args[1]
without checking args.size(); add a bounds check at the start of
AtomicMin::AtomicMin (e.g., if (args.size() < 2) throw/assert with a clear
message) before calling NormalizeToBufferRegion, so you validate the input and
avoid out-of-bounds access when constructing the AtomicMinNode and assigning
node->src/node->dst and ranges.
♻️ Duplicate comments (4)
src/op/atomic_add.h (1)
69-70: Unused method declarationLowerTMA.
LowerTMAis declared but never defined or called. The TMA lowering logic inatomic_add.ccis directly embedded inLower()(lines 348-560). Either remove this declaration or extract the TMA logic into this method for consistency.Proposed fix (if not extracting TMA logic)
/// Compute linear layout for shared tensor (used in TMA atomic add) Layout ComputeLinearLayout(const Buffer &shared_tensor) const; - - /// Lower TMA-based atomic add - Stmt LowerTMA(const LowerArgs &T, arith::Analyzer *analyzer) const; };src/op/atomic_reduce.h (1)
43-50: Apply suggested fix for template keyword inas<>call.Per the past review comment, this should use
val->template as<IntImmNode>()to fix the build error on some compilers.🐛 Proposed fix
int GetMemoryOrder() const { if (auto val = annotations.Get("memory_order")) { - if (auto int_val = val->as<IntImmNode>()) { + if (auto int_val = val->template as<IntImmNode>()) { return int_val->value; } } return 0; }src/op/atomic_reduce.cc (2)
29-42: Missing bounds check onargsarray access.The constructor accesses
args[0]andargs[1]without verifying thatargs.size() >= 2. This will cause undefined behavior if called with fewer arguments.🐛 Proposed fix
AtomicMax::AtomicMax(Array<PrimExpr> args, Map<String, ObjectRef> annotations) { + ICHECK(args.size() >= 2) << "AtomicMax requires at least 2 arguments, got " << args.size(); ObjectPtr<AtomicMaxNode> node = tvm::ffi::make_object<AtomicMaxNode>();
154-157: Incorrect index handling for scalar case with multi-dimensional buffers.When all dimensions have extent 1,
loop_varsis empty but the buffer may still be multi-dimensional (e.g., shape(1, 1, 1)). Using hardcoded{0}provides only one index when multiple may be required.The
MakeIndicesfunction correctly handles this by usingranges[i]->minfor unit-extent dimensions, but it's bypassed here.🐛 Proposed fix
if (is_scalar) { + // Use MakeIndices to construct proper indices for all dimensions + Array<PrimExpr> src_indices = MakeIndices(loop_vars, 0); + Array<PrimExpr> dst_indices = MakeIndices(loop_vars, 1); return For(Var("i"), 0, 1, ForKind::kSerial, - BufferStore(dst, BufferLoad(src, {0}), {0})); + BufferStore(dst, BufferLoad(src, src_indices), dst_indices)); }
🧹 Nitpick comments (5)
src/op/atomic_reduce.h (2)
15-15: Avoidusing namespacein header files.
using namespace tir;in a header pollutes the namespace for all translation units that include this file, potentially causing name collisions and making code harder to maintain.Consider removing this directive and qualifying names explicitly (e.g.,
tir::Buffer,tir::Range), or confining it to a.ccfile.
34-34: Consider makingpar_op_protected or private.
par_op_is a mutable member used for internal state during lowering. Exposing it publicly allows external code to modify it unexpectedly. Consider moving it toprotectedto match the other internal helpers.src/op/atomic_add.h (1)
56-57: Document vectorization length assumptions.Consider adding a brief doc comment explaining the returned values (2 for fp16/bf16, 4 for fp32 on SM>=90, 1 otherwise) directly in the header for quick reference, rather than only in the implementation.
src/op/atomic_reduce.cc (2)
87-99: Variable naming may overflow for high-dimensional tensors.
char('i' + idx)will produce non-alphabetic characters for tensors with more than 17 dimensions (after 'z'). While unlikely in practice, consider using a more robust naming scheme.♻️ Proposed improvement
- Var var = Var(std::string{char('i' + idx)}, src_range[i]->extent->dtype); + Var var = Var("i" + std::to_string(idx), src_range[i]->extent->dtype);
211-227: Consider propagating fragment layout when only one buffer has a layout inferred.Currently,
InferLayoutonly validates when bothsrcanddsthave layouts already present in the map. For consistency with how other fragment-based operators likeReduceOpandCumSumOphandle layout inference, consider propagating the layout from one buffer to the other when only one has a layout defined. This would make layout inference more robust when upstream operations only partially populate the layout map.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/op/atomic_add.ccsrc/op/atomic_add.hsrc/op/atomic_reduce.ccsrc/op/atomic_reduce.h
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-12T07:25:35.591Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1631
File: src/transform/thread_storage_sync.cc:1126-1137
Timestamp: 2026-01-12T07:25:35.591Z
Learning: In TileLang's thread storage synchronization pass (src/transform/thread_storage_sync.cc), at the IR level where PointerAccessIsDisjoint is called, the threads array in AccessEntry is guaranteed to contain all three thread dimensions (threadIdx.x, threadIdx.y, threadIdx.z), making access to the last 3 elements via `threads[threads.size() + idx - 3]` safe.
Applied to files:
src/op/atomic_reduce.cc
🧬 Code graph analysis (3)
src/op/atomic_reduce.h (4)
src/op/atomic_add.h (2)
tl(12-84)GetElemOp(33-44)src/op/builtin.h (1)
tl(22-717)src/op/operator.h (2)
TileOperatorNode(75-85)TileOperator(87-91)src/transform/common/loop_vectorization_utils.h (1)
if(443-445)
src/op/atomic_add.h (2)
src/op/atomic_reduce.h (4)
tl(13-135)AtomicOpBaseNode(24-134)GetElemOp(77-88)GetElemOp(109-120)src/op/atomic_add.cc (12)
Lower(346-583)Lower(346-346)InferLayout(253-309)InferLayout(253-254)GetElemOp(77-77)GetElemOp(77-77)GetVectorizeLength(90-100)GetVectorizeLength(90-90)MakeSIMTLoop(143-207)MakeSIMTLoop(143-143)ComputeLinearLayout(219-234)ComputeLinearLayout(219-219)
src/op/atomic_reduce.cc (2)
src/op/atomic_reduce.h (5)
AtomicMax(92-100)GetElemOp(77-88)GetElemOp(109-120)AtomicMin(124-132)GetMemoryOrder(43-89)src/op/atomic_add.cc (6)
Clone(69-75)Clone(69-69)GetElemOp(77-77)GetElemOp(77-77)MakeSIMTLoop(143-207)MakeSIMTLoop(143-143)
🪛 Cppcheck (2.19.0)
src/op/atomic_reduce.cc
[error] 269-269: syntax error
(syntaxError)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
🔇 Additional comments (8)
src/op/atomic_reduce.h (2)
91-100: LGTM on wrapper class design.The
AtomicMaxandAtomicMinwrapper classes follow TVM's standard pattern withTVM_FFI_DEFINE_OBJECT_REF_METHODS_NULLABLE, constructors accepting args and annotations, and staticGet()accessors. This is consistent withAtomicAdd.Also applies to: 123-132
71-74: No action needed.The third argument to
TVM_FFI_DECLARE_OBJECT_INFO_FINALis correct. All TileOperatorNode-derived classes throughout the codebase—includingAtomicAddNode,CopyNode,GemmNode, and others—consistently useTileOperatorNodeas the third argument regardless of whether they have an intermediate base class. This is the correct FFI type hierarchy registration for TVM's object system, where the macro declares the root FFI type rather than the direct parent class.Likely an incorrect or invalid review comment.
src/op/atomic_add.h (1)
9-9: LGTM on inheritance refactor.The change to inherit from
AtomicOpBaseNodeproperly consolidates common atomic operation functionality. The class documentation clearly describes the TMA support and vectorization extensions specific toAtomicAdd.Also applies to: 16-21
src/op/atomic_add.cc (3)
373-374:global_coordsmay be unused in some code paths.
global_coordsis computed but the variable appears to only be used inside the loop when(*inner_box_dim) != instruction_dim(line 532) and in the else branch (line 549). Verify this is intentional.
561-582: LGTM on ParallelOp-based lowering.The non-TMA lowering path correctly:
- Builds a SIMT loop via
MakeSIMTLoop- Fuses parallel loops with
ParallelLoopFuser::Fuse- Runs layout inference at multiple levels (kCommon, kStrict, kFree)
- Lowers with
LowerParallelLoopincluding predicate handlingThis aligns with the broader PR's architectural shift to ParallelOp-based lowering.
90-100: LGTM on vectorization length logic.The implementation correctly returns:
- 2 for float16/bfloat16 (atomicAdd.x2)
- 4 for float32 on SM >= 90 (atomicAdd.x4)
- 1 for all other cases
This aligns with CUDA's vectorized atomic add capabilities.
src/op/atomic_reduce.cc (2)
229-253: LGTM on base class Lower implementation.The
AtomicOpBaseNode::Lowercorrectly implements the ParallelOp-based lowering pipeline:
- Creates SIMT loop via
MakeSIMTLoop- Fuses parallel loops
- Runs layout inference at all levels
- Lowers with
LowerParallelLoopThis provides a clean default implementation that
AtomicMaxandAtomicMininherit directly.
259-272: LGTM on operator registration.The
TIR_REGISTER_TL_TILE_OPregistrations correctly set upatomicmaxandatomicminwith 2 inputs and opaque call effect. The reflection registration inTVM_FFI_STATIC_INIT_BLOCKfollows the established pattern.Note: The static analysis "syntax error" at line 269 appears to be a false positive - the
TVM_FFI_STATIC_INIT_BLOCK()macro is a valid TVM construct.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
- Added checks in AtomicAdd, AtomicMax, and AtomicMin constructors to ensure at least two arguments are provided, improving error handling and user feedback. - Removed the unused LowerTMA method declaration from the AtomicAdd class, streamlining the codebase.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/op/copy.cc`:
- Around line 727-728: The return uses an undefined variable `transformed_loop`;
replace it with the fused loop variable `fused_loop` so the call becomes
VectorizeLoop(fused_loop, T.layout_map) and return the resulting
`vectorized_thread_loop` (i.e., ensure VectorizeLoop is invoked with
`fused_loop` instead of `transformed_loop` in the code path that computes and
returns `vectorized_thread_loop`).
♻️ Duplicate comments (1)
src/op/atomic_reduce.cc (1)
157-163: Scalar case still uses incorrect hardcoded indices for multi-dimensional buffers.This issue was previously flagged. When all dimensions have unit extent,
loop_varsis empty but the buffer may still require multiple indices (e.g., shape(1, 1, 1)needs 3 indices). The existingMakeIndicesmethod correctly handles this by usingranges[i]->minfor each dimension.🐛 Proposed fix
For AtomicOpBaseNode::MakeSIMTLoop(arith::Analyzer *analyzer) const { Array<IterVar> loop_vars = MakeIterVars(); bool is_scalar = loop_vars.empty(); if (is_scalar) { + Array<PrimExpr> src_indices = MakeIndices(loop_vars, 0); + Array<PrimExpr> dst_indices = MakeIndices(loop_vars, 1); return For(Var("i"), 0, 1, ForKind::kSerial, - BufferStore(dst, BufferLoad(src, {0}), {0})); + BufferStore(dst, BufferLoad(src, src_indices), dst_indices)); }Note: The same issue exists in
AtomicAddNode::MakeSIMTLoop(lines 145-150 inatomic_add.cc).
🧹 Nitpick comments (1)
src/op/atomic_add.cc (1)
93-103: Extract shared vectorization length logic into a utility function.The
GetVectorizeLengthmethod inAtomicAddNodeduplicates the logic fromGetMaxVectorSizeinsrc/transform/atomicadd_vectorize.cc(lines 64-73). Both implement identical vectorization length determination (float16/bfloat16 → 2, float32 with SM≥90 → 4, else → 1). Extract this into a shared utility to follow the DRY principle.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/op/atomic_add.ccsrc/op/atomic_add.hsrc/op/atomic_reduce.ccsrc/op/copy.ccsrc/op/fill.ccsrc/transform/loop_vectorize.cc
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-12T07:25:35.591Z
Learnt from: silentCoder-dev
Repo: tile-ai/tilelang PR: 1631
File: src/transform/thread_storage_sync.cc:1126-1137
Timestamp: 2026-01-12T07:25:35.591Z
Learning: In TileLang's thread storage synchronization pass (src/transform/thread_storage_sync.cc), at the IR level where PointerAccessIsDisjoint is called, the threads array in AccessEntry is guaranteed to contain all three thread dimensions (threadIdx.x, threadIdx.y, threadIdx.z), making access to the last 3 elements via `threads[threads.size() + idx - 3]` safe.
Applied to files:
src/op/atomic_reduce.cc
🧬 Code graph analysis (6)
src/op/copy.cc (2)
src/transform/loop_vectorize.cc (4)
VectorizeLoop(489-500)VectorizeLoop(489-490)VectorizeLoop(502-512)VectorizeLoop(502-503)tilelang/transform/__init__.py (1)
VectorizeLoop(334-342)
src/op/atomic_add.cc (5)
tilelang/language/ast/ir.py (1)
target(1677-1707)src/transform/atomicadd_vectorize.cc (2)
dtype(64-73)dtype(64-64)src/target/utils.cc (2)
TargetHasSMVersionGE(130-135)TargetHasSMVersionGE(130-130)tilelang/ir.py (1)
ParallelOp(60-60)src/transform/loop_partition.cc (2)
LowerParallelLoop(274-310)LowerParallelLoop(274-277)
src/transform/loop_vectorize.cc (2)
src/op/utils.h (1)
IsGlobalBuffer(55-57)src/target/utils.cc (2)
TargetHasSMVersionGE(130-135)TargetHasSMVersionGE(130-130)
src/op/atomic_reduce.cc (2)
src/op/atomic_add.cc (6)
Clone(72-78)Clone(72-72)GetElemOp(80-80)GetElemOp(80-80)MakeSIMTLoop(146-210)MakeSIMTLoop(146-146)src/op/atomic_add.h (1)
GetElemOp(33-44)
src/op/atomic_add.h (2)
src/op/atomic_reduce.h (4)
tl(13-135)AtomicOpBaseNode(24-134)GetElemOp(77-88)GetElemOp(109-120)src/op/atomic_add.cc (10)
Lower(349-586)Lower(349-349)InferLayout(256-312)InferLayout(256-257)GetElemOp(80-80)GetElemOp(80-80)GetVectorizeLength(93-103)GetVectorizeLength(93-93)MakeSIMTLoop(146-210)MakeSIMTLoop(146-146)
src/op/fill.cc (1)
src/transform/loop_vectorize.cc (4)
VectorizeLoop(489-500)VectorizeLoop(489-490)VectorizeLoop(502-512)VectorizeLoop(502-503)
🪛 Cppcheck (2.19.0)
src/op/atomic_reduce.cc
[error] 275-275: syntax error
(syntaxError)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
- GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (19)
src/op/copy.cc (1)
745-748: LGTM:LowerParallelLoopcall updated correctly.The call now properly passes
T.layout_mapand the predicate frompar_op->GetPredicate(T.thread_var), aligning with the updated function signature that threads layout information through the lowering pipeline.src/transform/loop_vectorize.cc (4)
43-71: LGTM: Well-implemented contiguity check.The
IsBufferContiguousfunction correctly validates row-major layout by checking that strides match the expected pattern (rightmost stride = 1, each stride = next stride × shape). The empty strides case correctly returns true for implicitly contiguous buffers.
182-218: LGTM: Atomic add vectorization logic is sound.The handling for
atomic_add_elem_opcorrectly:
- Validates at least 2 arguments
- Extracts dtype from the
address_of(BufferLoad)pattern- Applies appropriate vectorize lengths (2 for fp16/bf16, 4 for fp32 on SM≥90, 1 otherwise)
- Avoids traversing pointer-type arguments that cannot be vectorized
The fix from the past review (using
ffi::GetRef<PrimExpr>(node)at line 215) has been correctly applied.
234-265: LGTM: Layout-aware index transformation.The logic correctly handles layout transformation when a
layout_mapentry exists:
- Forwards indices through the layout
- Computes linear offset when dimensions don't match
- Decomposes back to buffer shape dimensions
The contiguity check at line 236 ensures this transformation is only applied to row-major buffers.
382-390: LGTM: API functions updated consistently.The
GetVectorizeSizeandVectorizeLoopoverloads are consistently updated to accept and propagatelayout_mapthrough theVectorizePlanner. This maintains API consistency across the vectorization pipeline.Also applies to: 489-512
src/op/fill.cc (1)
170-171: LGTM: ConsistentVectorizeLoopAPI updates.All three
VectorizeLoopcalls are correctly updated to passT.layout_mapas the second argument, matching the new function signature. The changes are consistent across all buffer scope paths (fragment, local, shared/global).Also applies to: 179-180, 194-195
src/op/atomic_add.cc (2)
45-47: LGTM: Good input validation.The added argument count check ensures
AtomicAddreceives at least 2 arguments (src and dst), providing a clear error message if violated. This is good defensive coding.
564-585: LGTM: Clean lowering pipeline implementation.The non-TMA lowering path correctly:
- Creates a SIMT loop and fuses parallel loops
- Wraps in
ParallelOpfor layout inference- Runs inference at multiple levels (Common, Strict, Free)
- Uses
LowerParallelLoopwith the new layout_map parameterThis aligns with the broader PR goal of threading layout information through the lowering pipeline.
src/op/atomic_add.h (3)
16-21: LGTM: Clean inheritance fromAtomicOpBaseNode.The class correctly inherits from
AtomicOpBaseNode, gaining shared functionality for atomic operations (MakeIterVars, MakeIndices, MakePredicate, GetMemoryOrder) while adding TMA-specific extensions.
26-33: LGTM: Override declarations are correct.The
Lower,InferLayout, andGetElemOpoverrides are properly declared to add TMA support and provide the element operation for atomic add. The previous issue about an unusedLowerTMAdeclaration has been addressed.
36-44: Reflection registration pattern is consistent with codebase conventions.The inherited members (
src,dst,src_range,dst_range,annotations) are explicitly re-registered in the reflection definition, which matches the established pattern used inAtomicMaxNode,AtomicMinNode, and other operator classes throughout the codebase. This is the standard way TVM's reflection system handles inherited members—no verification concerns.src/op/atomic_reduce.cc (8)
29-55: LGTM! Bounds check properly added.The
ICHECKfor argument count addresses the previous review feedback. The implementation follows the same pattern asAtomicAddNodeinatomic_add.cc.
61-87: LGTM! Implementation is consistent with AtomicMax.The bounds check is properly added, and the implementation mirrors AtomicMax structure correctly.
93-105: LGTM!The iteration variable generation correctly skips unit-extent dimensions. The variable naming scheme is adequate for typical use cases.
107-155: LGTM!
MakeIndicescorrectly handles multi-dimensional buffers with unit-extent dimensions by usingranges[i]->min.MakePredicateappropriately uses symbolic analysis to eliminate provably-true conditions.
178-215: LGTM!The non-scalar path correctly generates SIMT loops with proper indices, atomic call construction, and parallel loop nesting. The polymorphic
GetElemOp()call ensures the appropriate element operation is used.
217-233: LGTM!Layout inference correctly validates that source and destination fragment buffers have matching layouts, preventing subtle data corruption bugs.
235-259: LGTM!The lowering pipeline correctly creates a SIMT loop, fuses it, performs multi-level layout inference via
ParallelOp, and lowers to the final representation.
265-278: LGTM! Registration and reflection setup is correct.The operator registration with
kOpaquecall effect is appropriate for atomic operations. The static analysis syntax error on line 275 is a false positive—TVM_FFI_STATIC_INIT_BLOCKis a valid TVM macro for static initialization.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
|
@regression-perf |
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
|
@regression-perf |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/transform/loop_vectorize.cc`:
- Around line 182-207: The comment for handling atomic_add_elem_op incorrectly
says args are not visited, but the code calls
arith::IRMutatorWithAnalyzer::VisitExpr_(node) which does visit args; either
stop visiting args by returning the original call expression immediately in the
atomic_add_elem_op branch (i.e., avoid calling VisitExpr_ so pointer args are
not traversed) or keep the current behavior and update the comment to state that
args are intentionally visited; modify the branch handling atomic_add_elem_op
(the block examining address_of, BufferLoadNode, dtype and setting vector_size_)
to implement one of these two fixes so the code and comment are consistent.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/transform/loop_vectorize.cc
🧰 Additional context used
🧬 Code graph analysis (1)
src/transform/loop_vectorize.cc (2)
src/op/utils.h (1)
IsGlobalBuffer(55-57)src/target/utils.cc (2)
TargetHasSMVersionGE(130-135)TargetHasSMVersionGE(130-130)
🔇 Additional comments (7)
src/transform/loop_vectorize.cc (7)
43-71: LGTM!The
IsBufferContiguousfunction correctly validates row-major contiguity by checking that strides follow the expected pattern:strides[n-1] = 1andstrides[i] = strides[i+1] * shape[i+1]. The use ofanalyzer->CanProveEqualfor symbolic comparison is appropriate.
106-108: LGTM!The constructor update cleanly adds optional
LayoutMapsupport while maintaining backward compatibility through the default empty map.
152-171: LGTM!Good refactor to use
IsSharedBuffer/IsGlobalBufferutility functions instead of inline scope string comparisons. This improves maintainability and consistency with other parts of the codebase.
238-272: LGTM with a minor note.The layout map handling logic is correct:
- Index forwarding through layout
- Linear offset computation and decomposition when dimensions differ
- Proper reversal of decomposed indices
The
ICHECKon line 244 provides a clear assertion for the contiguity requirement, which is appropriate for catching developer errors early.
389-397: LGTM!The
GetVectorizeSizeoverloads correctly thread theLayoutMapparameter toVectorizePlanner, maintaining a clean and consistent API.
496-519: LGTM!The
VectorizeLoopoverloads properly propagate theLayoutMapparameter, maintaining consistency with theGetVectorizeSizefunctions and backward compatibility.
212-219: LGTM!Explicitly allowing bitwise operations to be vectorized is appropriate since these operations are inherently element-wise and can benefit from SIMD vectorization.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| } else if (node->op == tl::atomic_add_elem_op()) { | ||
| // Assert at least 2 args (dst_ptr and src) | ||
| ICHECK(node->args.size() >= 2) | ||
| << "atomic_add_elem_op requires at least 2 args (dst and src)"; | ||
|
|
||
| // Get dst dtype from args[0] (address_of call containing BufferLoad) | ||
| auto address_of_call = node->args[0].as<CallNode>(); | ||
| ICHECK(address_of_call && address_of_call->op == builtin::address_of()) | ||
| << "atomic_add_elem_op first arg must be address_of call"; | ||
|
|
||
| auto buffer_load = address_of_call->args[0].as<BufferLoadNode>(); | ||
| ICHECK(buffer_load) << "address_of arg must be BufferLoad"; | ||
|
|
||
| DataType dtype = buffer_load->buffer->dtype; | ||
| int vectorize_length = 1; | ||
| if (dtype.is_float16() || dtype.is_bfloat16()) { | ||
| vectorize_length = 2; | ||
| } else if (dtype.is_float() && dtype.bits() == 32 && | ||
| TargetHasSMVersionGE(Target::Current(false), 90)) { | ||
| vectorize_length = 4; | ||
| } | ||
|
|
||
| vector_size_ = arith::ZeroAwareGCD(vector_size_, vectorize_length); | ||
| // Do not visit the args of atomic_add_elem_op, because pointer type | ||
| // is impossible to vectorize | ||
| return arith::IRMutatorWithAnalyzer::VisitExpr_(node); |
There was a problem hiding this comment.
Comment contradicts behavior: args are still visited.
The comment on lines 205-206 states "Do not visit the args of atomic_add_elem_op, because pointer type is impossible to vectorize," but line 207 calls arith::IRMutatorWithAnalyzer::VisitExpr_(node) which will recursively visit all arguments.
If the intent is to skip visiting args (as done for "Other calls" on line 223), this should return early without calling the base class visitor:
🔧 Suggested fix to skip arg visitation
vector_size_ = arith::ZeroAwareGCD(vector_size_, vectorize_length);
// Do not visit the args of atomic_add_elem_op, because pointer type
// is impossible to vectorize
- return arith::IRMutatorWithAnalyzer::VisitExpr_(node);
+ return ffi::GetRef<PrimExpr>(node);Alternatively, if visiting args is intentional, update the comment to reflect the actual behavior.
🤖 Prompt for AI Agents
In `@src/transform/loop_vectorize.cc` around lines 182 - 207, The comment for
handling atomic_add_elem_op incorrectly says args are not visited, but the code
calls arith::IRMutatorWithAnalyzer::VisitExpr_(node) which does visit args;
either stop visiting args by returning the original call expression immediately
in the atomic_add_elem_op branch (i.e., avoid calling VisitExpr_ so pointer args
are not traversed) or keep the current behavior and update the comment to state
that args are intentionally visited; modify the branch handling
atomic_add_elem_op (the block examining address_of, BufferLoadNode, dtype and
setting vector_size_) to implement one of these two fixes so the code and
comment are consistent.
Performance Regression Test ReportTriggered by: @LeiWang1999 Results
Artifacts
|
Summary
This PR introduces atomic reduction operations (AtomicMax, AtomicMin) and enhances the existing atomic add functionality with automatic vectorization support and code architecture improvements.
Key Changes
T.atomic_maxandT.atomic_minfor both element-wise and tile-region-based operationsatomic_add_elem_optoatomic_addx2_elem_op(for float16/bfloat16) oratomic_addx4_elem_op(for float32 on SM>=90)AtomicAddNodeto inherit fromAtomicOpBaseNode<T>using CRTP pattern, reducing code duplicationreturn_prev=Trueparameter to retrieve the previous value in atomic operationsatomic_add_ret_elem_op,atomic_max_ret_elem_op,atomic_min_ret_elem_op)New APIs
Example Usage
Vectorization Pass
The new vectorization pass automatically detects
ForKind::kVectorizedloops containingatomic_add_elem_opand converts them to vectorized versions:float16/bfloat16: Usesatomic_addx2_elem_op(2 elements per operation)float32on SM>=90: Usesatomic_addx4_elem_op(4 elements per operation)Architecture Improvements
AtomicOpBaseNode<Derived>template base class using CRTP patternMakeIterVars,MakeIndices,MakePredicate,MakeSIMTLoop,Lower, andInferLayoutAtomicAddNode,AtomicMaxNode, andAtomicMinNodeall inherit from this base classTest plan
test_atomic_add- Basic atomic addtest_atomic_max- Atomic maximumtest_atomic_min- Atomic minimumtest_atomic_load_store- Atomic load/store operationstest_atomic_memory_order- Memory ordering supporttest_atomic_addx2_half- Vectorized atomic add for float16test_atomic_addx2_float- Vectorized atomic add for float32test_atomic_addx4- 4-wide vectorized atomic addtest_atomic_return_prev- Return previous valuetest_tile_atomic_add- Tile-region atomic addtest_tile_atomic_max- Tile-region atomic maxtest_tile_atomic_min- Tile-region atomic mintest_tma_atomic_add- TMA-based atomic add🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Vectorization
Codegen
Tests
✏️ Tip: You can customize this high-level summary in your review settings.